{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\nMatrix Multiply Blocking\n========================\n**Author**: `Thierry Moreau <https://homes.cs.washington.edu/~moreau/>`_\n\nThis tutorial provides an overview on how to use TVM to map matrix\nmultiplication efficiently on the VTA design.\nWe recommend covering the `basic-mat-mult` tutorial first.\n\nIn this tutorial, we will demonstrate TVM schedule optimizations to break large\nneural network operators down onto smaller blocks to achieve computation within\nlimited hardware accelerator resources.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "RPC Setup\n---------\nWe start by programming the Pynq's FPGA and building its RPC runtime.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from __future__ import absolute_import, print_function\n\nimport os\nimport tvm\nfrom tvm import te\nimport vta\nimport numpy as np\nfrom tvm import rpc\nfrom tvm.contrib import utils\nfrom vta.testing import simulator\n\n# Load VTA parameters from the 3rdparty/vta-hw/config/vta_config.json file\nenv = vta.get_env()\n\n# We read the Pynq RPC host IP address and port number from the OS environment\nhost = os.environ.get(\"VTA_RPC_HOST\", \"192.168.2.99\")\nport = int(os.environ.get(\"VTA_RPC_PORT\", \"9091\"))\n\n# We configure both the bitstream and the runtime system on the Pynq\n# to match the VTA configuration specified by the vta_config.json file.\nif env.TARGET == \"pynq\":\n\n    # Make sure that TVM was compiled with RPC=1\n    assert tvm.runtime.enabled(\"rpc\")\n    remote = rpc.connect(host, port)\n\n    # Reconfigure the JIT runtime\n    vta.reconfig_runtime(remote)\n\n    # Program the FPGA with a pre-compiled VTA bitstream.\n    # You can program the FPGA with your own custom bitstream\n    # by passing the path to the bitstream file instead of None.\n    vta.program_fpga(remote, bitstream=None)\n\n# In simulation mode, host the RPC server locally.\nelif env.TARGET in [\"sim\", \"tsim\"]:\n    remote = rpc.LocalSession()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Computation Declaration\n-----------------------\nAs a first step, we need to describe our matrix multiplication computation.\nWe define the matrix multiplication as the computation one would find in a\nfully connected layer, defined by its batch size, input channels, and output\nchannels.\nThese have to be integer multiples of the VTA tensor shape:\n:code:`BATCH`, :code:`BLOCK_IN`, and :code:`BLOCK_OUT` respectively.\n\nWe've added extra operators to the matrix multiplication that apply\nshifting and clipping to the output in order to mimic a fixed-point\nmatrix multiplication followed by a rectified linear activation.\nWe describe the TVM dataflow graph of the fully connected layer below:\n\n![](https://raw.githubusercontent.com/uwsampl/web-data/main/vta/tutorial/fc_dataflow.png)\n\n     :align: center\n\nThis computation is intentionally too large to fit onto VTA's on-chip\nbuffers all at once. Therefore in the scheduling phase we'll\nrely on computation blocking strategies to break the computation down into\nmanageable chunks.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Fully connected layer dimensions: 1024 x 1024\nbatch_size = 1\nin_channels = 1024\nout_channels = 1024\nassert batch_size % env.BATCH == 0\nassert in_channels % env.BLOCK_IN == 0\nassert out_channels % env.BLOCK_OUT == 0\n\n# Let's derive the tiled input tensor shapes\ndata_shape = (batch_size // env.BATCH, in_channels // env.BLOCK_IN, env.BATCH, env.BLOCK_IN)\nweight_shape = (\n    out_channels // env.BLOCK_OUT,\n    in_channels // env.BLOCK_IN,\n    env.BLOCK_OUT,\n    env.BLOCK_IN,\n)\noutput_shape = (batch_size // env.BATCH, out_channels // env.BLOCK_OUT, env.BATCH, env.BLOCK_OUT)\nnum_ops = in_channels * out_channels * batch_size * 2\n\n# Reduction axes\nic = te.reduce_axis((0, in_channels // env.BLOCK_IN), name=\"ic\")\nic_tns = te.reduce_axis((0, env.BLOCK_IN), name=\"ic_tns\")\n\n# Input placeholder tensors\ndata = te.placeholder(data_shape, name=\"data\", dtype=env.inp_dtype)\nweight = te.placeholder(weight_shape, name=\"weight\", dtype=env.wgt_dtype)\n\n# Copy buffers\ndata_buf = te.compute(data_shape, lambda *i: data(*i), \"data_buf\")\nweight_buf = te.compute(weight_shape, lambda *i: weight(*i), \"weight_buf\")\n\n# Declare matrix multiply computation\nres_gemm = te.compute(\n    output_shape,\n    lambda bo, co, bi, ci: te.sum(\n        data_buf[bo, ic, bi, ic_tns].astype(env.acc_dtype)\n        * weight_buf[co, ic, ci, ic_tns].astype(env.acc_dtype),\n        axis=[ic, ic_tns],\n    ),\n    name=\"res_gem\",\n)\n\n# Add shift stage for fix-point normalization\nres_shr = te.compute(output_shape, lambda *i: res_gemm(*i) >> env.INP_WIDTH, name=\"res_shr\")\n\n# Apply clipping between (0, input max value)\ninp_max = (1 << (env.INP_WIDTH - 1)) - 1\nres_max = te.compute(output_shape, lambda *i: tvm.te.max(res_shr(*i), 0), \"res_max\")\nres_min = te.compute(output_shape, lambda *i: tvm.te.min(res_max(*i), inp_max), \"res_min\")\n\n# Apply typecast to input data type before sending results back\nres = te.compute(output_shape, lambda *i: res_min(*i).astype(env.inp_dtype), name=\"res\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Scheduling the Computation\n--------------------------\nWe'll look at a set of schedule transformations necessary to map the\nmatrix multiplications onto VTA in an efficient fashion.\nThose include:\n\n- Computation blocking\n- Lowering to VTA hardware intrinsics\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Create TVM schedule\ns = te.create_schedule(res.op)\n# Let's look at the default TVM schedule\nprint(tvm.lower(s, [data, weight, res], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Blocking the Computation\n~~~~~~~~~~~~~~~~~~~~~~~~\nThe matrix multiplication is by default too large for activations or weights\nto fit on VTA's on-chip buffers all at once.\nWe block the (1, 1024) by (1024, 1024) matrix multiplication into\nsmaller (1, 256) by (256, 256) matrix multiplications so the intermediate\ntensors can fit on the accelerator's on-chip SRAM.\nThis approach is similar to blocking techniques applied to CPUs and GPUs in\norder to increase cache hit rate.\n\nWe perform blocking along each axes (the batch axis being untouched since\nwe are performing singe-batch inference).\nWe also leave the inner-most tensorization axes as-is in order to allow\nTVM to pattern-match tensorization.\nWe show the outcome of blocking on the computation schedule in the diagram\nbelow:\n\n![](https://raw.githubusercontent.com/uwsampl/web-data/main/vta/tutorial/blocking.png)\n\n     :align: center\n     :width: 480px\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>The code after loop splitting and reordering is equivalent to the following\n  pseudo-code. We ignore the batch axis since we are only performing single-batch\n  inference in this example:\n\n  .. code-block:: c\n\n     for (int oc_out = 0; oc_out < 4; ++oc_out) {\n       // Initialization loop\n       for (int oc_inn = 0; oc_inn < 16; ++oc_inn) {\n        for (int oc_tns = 0; oc_tns < 16; ++oc_tns) {\n         int j = (oc_out * 16 + oc_inn) * 16 + oc_tns;\n         C[0][j] = 0;\n        }\n       }\n       for (int ic_out = 0; ic_out < 4; ++ic_out) {\n        // Block loop\n        for (int oc_inn = 0; oc_inn < 16; ++oc_inn) {\n         for (int ic_inn = 0; ic_inn < 16; ++ic_inn) {\n          // Tensorization loop\n          for (int oc_tns = 0; oc_tns < 16; ++oc_tns) {\n           for (int ic_tns = 0; ic_tns < 16; ++ic_tns) {\n            int i = (ic_out * 16 + ic_inn) * 16 + ic_tns;\n            int j = (oc_out * 16 + oc_inn) * 16 + oc_tns;\n            C[0][i] = C[0][i] + A[0][i] * B[j][i];\n           }\n          }\n         }\n        }\n       }\n      }\n     }</p></div>\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Let's define tiling sizes (expressed in multiples of VTA tensor shape size)\nb_block = 1 // env.BATCH\ni_block = 256 // env.BLOCK_IN\no_block = 256 // env.BLOCK_OUT\n\n# Tile the output tensor along the batch and output channel dimensions\n# (since by default we are doing single batch inference, the split along\n#  the batch dimension has no effect)\nb, oc, b_tns, oc_tns = s[res].op.axis\nb_out, b_inn = s[res].split(b, b_block)\noc_out, oc_inn = s[res].split(oc, o_block)\ns[res].reorder(b_out, oc_out, b_inn, oc_inn)\n\n# Move intermediate computation into each output compute tile\ns[res_gemm].compute_at(s[res], oc_out)\ns[res_shr].compute_at(s[res], oc_out)\ns[res_max].compute_at(s[res], oc_out)\ns[res_min].compute_at(s[res], oc_out)\n\n# Apply additional loop split along reduction axis (input channel)\nb_inn, oc_inn, b_tns, oc_tns = s[res_gemm].op.axis\nic_out, ic_inn = s[res_gemm].split(ic, i_block)\n\n# Reorder axes. We move the ic_out axis all the way out of the GEMM\n# loop to block along the reduction axis\ns[res_gemm].reorder(ic_out, b_inn, oc_inn, ic_inn, b_tns, oc_tns, ic_tns)\n\n# Let's look at the current TVM schedule after blocking\nprint(tvm.lower(s, [data, weight, res], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Lowering Copies to DMA Transfers\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nNext we set the buffer scopes to the corresponding on-chip VTA SRAM buffers.\nWe move the load loops into the matrix multiply computation loop to stage\nmemory loads such that they fit in the on-chip SRAM buffers.\nFinally we annotate the load/store loop outer axes with the DMA copy pragma\nto perform bulk memory transfers on VTA.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Set scope of SRAM buffers\ns[data_buf].set_scope(env.inp_scope)\ns[weight_buf].set_scope(env.wgt_scope)\ns[res_gemm].set_scope(env.acc_scope)\ns[res_shr].set_scope(env.acc_scope)\ns[res_min].set_scope(env.acc_scope)\ns[res_max].set_scope(env.acc_scope)\n\n# Block data and weight cache reads\ns[data_buf].compute_at(s[res_gemm], ic_out)\ns[weight_buf].compute_at(s[res_gemm], ic_out)\n\n# Use DMA copy pragma on DRAM->SRAM operations\ns[data_buf].pragma(s[data_buf].op.axis[0], env.dma_copy)\ns[weight_buf].pragma(s[weight_buf].op.axis[0], env.dma_copy)\n\n# Use DMA copy pragma on SRAM->DRAM operation\n# (this implies that these copies should be performed along b_inn,\n# or result axis 2)\ns[res].pragma(s[res].op.axis[2], env.dma_copy)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Lowering Computation to VTA Compute Intrinsics\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nThe last phase is to lower the computation loops down to VTA hardware\nintrinsics by mapping the matrix multiplication to tensor intrinsics,\nand mapping the shift, and clipping computation to the vector ALU.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Apply tensorization over the batch tensor tile axis\ns[res_gemm].tensorize(b_tns, env.gemm)\n\n# Add an ALU pragma over the shift and clipping operations\ns[res_shr].pragma(s[res_shr].op.axis[0], env.alu)\ns[res_min].pragma(s[res_min].op.axis[0], env.alu)\ns[res_max].pragma(s[res_max].op.axis[0], env.alu)\n\n# Let's look at the final lowered TVM schedule after lowering memory\n# loads/stores down to DMA copy intrinsics, and the computation down to\n# VTA compute intrinsics.\nprint(vta.lower(s, [data, weight, res], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "TVM Compilation and Verification\n--------------------------------\nAfter specifying the schedule, we can compile it into a TVM function.\nWe save the module so we can send it over RPC.\nWe run the function and verify it against a numpy implementation to\nensure correctness.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Compile the TVM module\nmy_gemm = vta.build(s, [data, weight, res], \"ext_dev\", env.target_host, name=\"my_gemm\")\ntemp = utils.tempdir()\nmy_gemm.save(temp.relpath(\"gemm.o\"))\nremote.upload(temp.relpath(\"gemm.o\"))\nf = remote.load_module(\"gemm.o\")\n\n# Get the remote device context\nctx = remote.ext_dev(0)\n\n# Initialize the data and weight arrays randomly in the int range of (-128, 128]\ndata_np = np.random.randint(-128, 128, size=(batch_size, in_channels)).astype(data.dtype)\nweight_np = np.random.randint(-128, 128, size=(out_channels, in_channels)).astype(weight.dtype)\n\n# Apply packing to the data and weight arrays from a 2D to a 4D packed layout\ndata_packed = data_np.reshape(\n    batch_size // env.BATCH, env.BATCH, in_channels // env.BLOCK_IN, env.BLOCK_IN\n).transpose((0, 2, 1, 3))\nweight_packed = weight_np.reshape(\n    out_channels // env.BLOCK_OUT, env.BLOCK_OUT, in_channels // env.BLOCK_IN, env.BLOCK_IN\n).transpose((0, 2, 1, 3))\n\n# Format the input/output arrays with tvm.nd.array to the DLPack standard\ndata_nd = tvm.nd.array(data_packed, ctx)\nweight_nd = tvm.nd.array(weight_packed, ctx)\nres_nd = tvm.nd.array(np.zeros(output_shape).astype(res.dtype), ctx)\n\n# Clear stats\nif env.TARGET in [\"sim\", \"tsim\"]:\n    simulator.clear_stats()\n\n# Invoke the module to perform the computation\nf(data_nd, weight_nd, res_nd)\n\n# Verify against numpy implementation\nres_ref = np.dot(data_np.astype(env.acc_dtype), weight_np.T.astype(env.acc_dtype))\nres_ref = res_ref >> env.INP_WIDTH\nres_ref = np.clip(res_ref, 0, inp_max)\nres_ref = res_ref.astype(res.dtype)\nres_ref = res_ref.reshape(\n    batch_size // env.BATCH, env.BATCH, out_channels // env.BLOCK_OUT, env.BLOCK_OUT\n).transpose((0, 2, 1, 3))\nnp.testing.assert_equal(res_ref, res_nd.numpy())\n\n# Print stats\nif env.TARGET in [\"sim\", \"tsim\"]:\n    sim_stats = simulator.stats()\n    print(\"Execution statistics:\")\n    for k, v in sim_stats.items():\n        print(\"\\t{:<16}: {:>16}\".format(k, v))\n\nprint(\"Successful blocked matrix multiply test!\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Summary\n-------\nThis tutorial demonstrates how TVM scheduling primitives can achieve\ncomputation blocking for a matrix multiplication example.\nThis allows us to map arbitrarily large computation onto limited\nhardware accelerator resources.\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}